It is widely accepted that data preparation is one of the most time-consuming steps of the machine learning (ML) lifecycle. It is also one of the most important steps, as the quality of data directly influences the quality of a model. In this session, we will discuss the importance and the role of exploratory data analysis (EDA) and data visualisation techniques to find data quality issues and for data preparation, relevant to building ML pipelines. We will also discuss the latest advances in these fields and bring out areas that need innovation. Finally, we will discuss on the challenges posed by industry workloads and the gaps to be addressed to make data-centric AI real in industry settings.
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Centric AI Systems
1. Advances in Exploratory Data
Analysis, Visualisation and Quality for
Data Centric AI Systems
Please add
your picture
in the box
here
Hima Patel Shanmukha
Guttula
Ruhi Sharma
Mittal
Naresh
Manwani
Laure Berti-
Equille
Abhijit
Manatkar
2. Who are we
IBM Research, India
The International
Institute of Information
Technology Hyderabad,
India
Institut de Recherche
pour le Développement,
France
Hima Patel Shanmukha
Guttula
Ruhi Sharma
Mittal
Naresh Manwani Abhijit Manatkar
Laure Berti-Equille
3. Hima Patel
Senior Technical Staff Member
Research Manager, Data and Hybrid Platforms
IBM Research India
Tutorial will be presented by:
@hima_patel
4. Tutorial Outline
• Part 1: Importance of Data Centric AI
• Part 2: Advances in exploratory data analysis
• Networking
• Part 3: Advances in data visualization techniques
• Part 4: Scalable data centric AI
• Open Discussion
The tutorial has been planned to cover the main research challenges, ideas and a discuss a few example
papers to understand the ideas better. We will not be covering all the papers and systems in each area.
6. Once upon a time..
Yay!! I am so
excited!!
After many weeks…
Still struggling
with the data
?
7. Data preparation is one of the most time consuming
steps of AI lifecycle
“Data collection and preparation are typically
the most time-consuming activities in developing
an AI-based application, much more so than
selecting and tuning a model.” – MIT Sloan Survey
https://sloanreview.mit.edu/projects/reshaping-business-with-artificial-
intelligence/
Data preparation accounts for about 80% of the work of data
scientists” - Forbes
https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-
most-time-consuming-least-enjoyable-data-science-task-survey-
says/#70d9599b6f63
8. Data preparation is also imperative for building AI
models
Data preparation for AI is a foundational and critical step for building better and faster AI pipelines
9. Broad components of data centric AI systems
Data
Quality
Analysis
….
Exploratory
Data
Analysis
Data
Visualisati
on
….
Data
Cleaning
Synthetic
Data
Generation
….
Data
Labelling
10. Enterprise data centric AI systems are expected to..
Data
Quality
Analysis
….
Explorator
y Data
Analysis
Data
Visualis
ation
….
Data
Cleaning
Syntheti
c Data
Generati
on
….
Data
Labelling
• Work on large datasets (Gigabytes, terabytes,..)
• Data is stored in multiple tables and in multiple sources..
• Be compute aware
11. Data Quality for ML and Cleaning
Gupta et al, KDD 2021 Jain et al, KDD 2020
Data Quality for
ML
Tabular
Datasets
Unstructured
Datasets
Spatio Temporal
Datasets
Metrics to measure data quality for ML tasks:
Data Cleaning
Class Imbalance
Data Valuation
Data Homogeneity
Data Transformation
Label Noise
Class Overlap
….
Select open source libraries:
Data Quality For AI :
https://developer.ibm.com/apis/catalog/dataquality4ai--data-quality-
for-ai/Introduction/
Tensorflow Data Validation: https://www.tensorflow.org/tfx/guide/tfdv
Pandas Profiler: https://github.com/pandas-profiling/pandas-profiling
Data
Quality
Analysis
….
Explorator
y Data
Analysis
Data
Visualis
ation
….
Data
Cleaning
Syntheti
c Data
Generati
on
….
Data
Labelling
12. In this tutorial, we will cover
Data
Quality
Analysis
….
Data
Labelling
Exploratory
Data
Analysis
….
Data
Cleaning
Synthetic
Data
Generation
….
Data
Visualisation
Challenges associated
with large
scale datasets
13. Tutorial Outline
• Part 1: Importance of Data Centric AI
• Part 2: Advances in exploratory data analysis
• Part 3: Advances in data visualization techniques
• Part 4: Scalable data centric AI
• Open Discussion
15. Importance of EDA
Before making inferences on your data, it is necessary to examine and understand
all your variables.
Why?
● To discover trends and relationships present in the data
● To find violations of statistical assumptions
● To catch data quality issues
● To uncover the structure of your dataset
16. Challenges while performing EDA
● Manual EDA is cumbersome and time consuming.
● Requires profound analytical skills
● Domain knowledge or access to subject matter expert
for the dataset
● No standard steps, varies from data scientist to data
scientist based on experience and skills.
To overcome the above challenges, there has been a
focus on automation of EDA in the last few years.
17. Broad areas of research
1. Automatic Interactive Data Exploration Techniques
2. EDA by capturing and predicting user’s interest
3. End to end EDA Automation and explanations
19. Steps followed by a user for data exploration
“Manual” iterative exploration:
• Query formulation
• Query processing
• Result reviewing (and back to step 1)
Challenges:
• Ad-hoc queries: “correct” predicates are unknown a priori
• Labor intensive: thousands of objects to review
• Resource intensive: execution of long query sequences on big data
20. Automation ideas
● Exploration model
• Relies on user’s relevance feedback on data samples
• Eliminates query formulation step
• Navigates the user through the data space
• Reduces result reviewing overhead
● Performance goals
• Effectiveness
• Captures user interests with high accuracy
• Efficiency
• Minimizes reviewing effort and compute effort
• Offers interactive experience
21. Active Learning Based Interactive Database Exploration
(AIDE) Huang et al. 2018, Dimitriadau et al. 2016
Picture Credit: Dimitriadau et al. 2016
23. EDA by capturing
and predicting
user’s interest
Automatic Interactive Data
Exploration Techniques
EDA by capturing and
predicting user’s interest
End to end EDA Automation
and explanations
24. Capturing user’s interest
In interactive data exploration systems, a user’s interest is captured via feedback
on relevant samples
However, user’s interest is :
- Subjective
- Can change dynamically in the same session
- Contextual (based on what was seen previously)
- May not be captured by one mathematical expression (interestingness
measure)
25. Interestingness Measures
Interestingness measures in the literature can be broadly grouped into following
buckets:
1. Diversity: Displays whose elements demonstrate notable differences in
values, are ranked higher.
2. Dispersion: It favors displays which have relatively similar elements.
3. Peculiarity: A display is peculiar if it presents or contains anomalous
patterns.
4. Conciseness: Such measures consider the size of the display, i.e. the number
of elements it contains. Displays that convey thousands of rows are difficult
to interpret, therefore are considered less interesting.
Geng and Hamilton, 2006 , McGarry, 2005.
27. Dynamic Interest Selection as Multiclass
Classification Milo et al. 2019
1. Given EDA sessions, create training data with the following input-output pairs.
Input is the current state of the EDA and output is the interesting measure.
2. Interesting measure can be found using approach discussed just now.
3. Thus, each interestingness measure is treated as a class.
4. Train a multiclass classifier using the session logs
5. At every step, dynamic interest selection is treated as multiclass
classification problem.
28. End to end EDA
Automation and
Explanations
Automatic Interactive Data
Exploration Techniques
EDA by capturing and
predicting user’s interest
End to end EDA Automation
and explanations
29. Fully Automated EDA
Fully Automated EDA: Given an input dataset, generate entire EDA session which
captures dataset highlights and interesting aspects.
Generated sessions should allow users to gain preliminary insights on their
dataset.
Reduced manual efforts and inputs.
30. ATENA: Deep RL Model for Fully Auto EDA (El et al.
2020)
Dataset
EDA
sessions
for the full
dataset
Use deep reinforcement learning method to generate EDA sessions
Main idea is to use interestingness measures as rewards.
31. ATENA: State and Action Spaces, Rewards
State Space: Display dt is encoded to a numeric vector, with the following
features:
Entropy, number of distinct values, and the number of null values for each
attribute.
For each attribute, whether it is currently grouped/aggregated.
Number of groups and the groups’ size mean and variance.
Display vectors of three most recent operations in the session.
Action Space: FILTER(attr, op, term), GROUP(g_attr, agg_func, agg_attr), BACK()
32. ATENA: State and Action Spaces, Rewards
Rewards:
Interestingness reward for group-by operations: promotes compact group-by
results that covers many tuples as both informative and easy to understand.
Interestingness reward for filter operations:favors filter operations whose result
display dt deviates significantly from the previous display dt−1
Diversity: To encourage actions inducing new observations of different parts of
the data than those examined thus far.
Coherency: Sequence of operations is compelling and easy to follow
33. Balancing Familiarity and Curiosity in Data Exploration with
Deep Reinforcement Learning (Personnaz et al. 2021)
Proposed Solution:
Modeled as A3C DRL Agent
Reward is defined as a function of familiarity and curiosity.
34. Auto Explanation of EDA Notebooks
EDA notebooks created by data
scientists are often referred back for
performing similar analysis.
However, most of these EDA
notebooks are not well documented
and explanation of each view is
missing.
For example, at each view, the
algorithm can tell which of the
element is most interesting.
35. ExplainED: Explanations for EDA Notebooks
Deutch et al. 2020.
Challenges:
1. How to evaluate the interestingness of the view?
Pick an interestingness measure from the list of possible measures that has
the highest score for a given view
2. How to show the most interesting part of the view?
Find the part of the tuple that contributes most to the interestingness score
via Shapley values (similar idea as feature selection)
36. Open Challenges
1. Can the rewards be made generic for any usecase? Can they be extended
to take care of operators specific to ML usecases (e.g. outliers, label
noise etc)
2. How to make the auto-generated sessions personalized, reactive to
users’ information needs?
3. How to build an effective, reproducible, experimental framework to
evaluate the quality of auto-generated sessions?
37. Summary
Three main areas:
Automatic Interactive Data Exploration Techniques
EDA by capturing and predicting user’s interest
End to end Automated EDA and explanations
Early work with deep learning systems and opportunity to expand
with more operators and generalization across usecases
38. Tutorial Outline
• Part 1: Importance of Data Centric AI
• Part 2: Advances in exploratory data analysis
• Part 3: Advances in data visualization techniques
• Part 4: Scalable data centric AI
• Open Discussion
40. Pipeline and Tools for Data Visualization
(Heer, 2022)
See also survey of (Qin et al., VLDB J., 2019)
(dos Santos et al., Computers & Graphics, 2004)
41. Main Challenges of Visualization Systems
● Accuracy
○ Reduce the impact of dirty data and show the uncertainties
● Usability
○ Integrate Human in the Loop
○ Be understood, interpreted, and trusted by humans
○ Ease/self-adapt the design, tuning, and use
● Efficiency
○ Runtime
○ Incremental
○ Progressive
Interactive
Visualization
Interactive
Visualization
42. Broad research areas
● Visualizations for data quality control
● Interactive visualization techniques
● Visualization recommendations techniques
44. Designing a Visual Analysis Pipeline for DQ Control:
Screening – Diagnosis – Correction
Adapted from Van
den Broeck et al.,
2005 by Liu et al.,
2018
45. Visualization Tools for Data Quality Control
(Ward et al. 2008) proposed a methodology to
measure and expose: data quality, abstraction
quality, and visual quality.
Among many DQ-ware visualisation tools:
- DaVis (Sulo et al., 2005)
- TimeCleanser (Gschwandtner et al., 2014)
- VisPlause (Arbesser et al, 2017)
(Kandel et al., 2011)
46. Visplause for DQ checks Arbesser et al, IEEE Trans. VCG 2017
https://www.youtube.com/watch?v=5stVUf5CC3E
47. TimeCleanser for Time-oriented data cleansing
Gschwandtner et al., 2014
Time-oriented data quality checks with a set of corresponding visual artifacts
48. Open areas/questions
● As we move towards more of AI usecases, there is a need for visualization
systems to focus on data quality for ML issues along with existing checks.
50. Interactive Visualization Shen et al., IEEE TVCG 2022
Visualization-oriented Natural Language Interfaces (V-NLI)
● NL2VIS systems take NL
queries as inputs and
provide visualizations as
output.
● Fundamental challenges:
○ Query intent understanding
○ Data transformation
○ Visual Mapping
○ View transformations
○ Human in loop interactions
○ Dialogue management
51. ncNet Luo et al, IEEE VCG 2021
ncNet: Natural Language to Visualization by Neural Machine Translation
52. Data-Debugging Through Interactive Visual
Explanations (Afzal et al, 2021)
● Data readiness as an
important module for
ML pipelines
● Certain remediations to
the data (example
change of bad labels
caused due to labeling
mistakes) needs SME
input and review
55. Open areas and questions
● As we move towards more of AI usecases, there is a need for visualization
systems to focus on data quality for ML issues along with existing checks.
● V-NLI interfaces today support queries closer to usecases to derive analytical
insights. Can it support queries for AI usecases (example find all label noise
data points in the data)
57. Importance of Visualization Recommendations
● Manual Visualization
○ Trial and error based model
○ Visual Encoding: Identify appropriate type of visualization (charts,
transformations)
○ Implementation: Code the visualization
● Automated Visualization Recommendations: automatically recommend (type of
graph, field to be encoded) for a given dataset
○ learn the visualization rules from data, experience , or user history
○ Incorporates data, visualization design context, user behavior etc.
59. Voyager (Rule Based) Wongsuphasawat et.al, TVCG 2016
● Architecture ● An Example
60. DeepEye (Hybrid) Luo et al, IEEE ICDE 2018
● DeepEye, an automatic data visualization system that tackle
○ Visual Recognition: given a visualization, is it “good” or “bad”?
○ Visualization Ranking: given two visualizations, which one is “better”?
○ Visualization Selection: given a dataset, how to find top-k visualizations?
61. VizML (ML based) Hu et al, ACM CHI 2019
● A Machine Learning Approach to Visualization Recommendation
62. Concluding Remarks
● Visual analytics offers efficient tools to help and engage the users in
data quality analysis and improvement
● Human in the loop still comes with multiple usability challenges
● The 4 Vs of Big Data
● There are many opportunities for:
○ Managing and orchestrating human/machine resources
○ Recommending features & impactful and accurate visualizations
○ Revisiting our frameworks and technologies to integrate adaptive
visual and interactive layers to ML black-boxes
62
63. Tutorial Outline
• Part 1: Importance of Data Centric AI
• Part 2: Advances in exploratory data analysis
• Part 3: Advances in data visualization techniques
• Part 4: Scalable data centric AI
• Open Discussion
66. Industry Challenges
● Growing data sizes: terabytes and petabytes of data
● How to conduct data quality checks?
● How to explore and visualize data efficiently?
● Compute considerations (also related to sustainability)
● Data is stored in different databases/sources
● connectivity to different sources, different schemas, ..
67. Automating data quality for ML at scale
● Schelter, 2018, Schelter, 2019 describe a system that is built on Spark and
can perform unit tests on data, built and deployed at Amazon
● Swami, ICDE 2020 describe a system “Data Sentinel” which is a declarative
production-scale data validation platform, built and deployed at LinkedIn.
● Breck, SysML 2019 describe a data validation for ML system that is designed
to detect anomalies specifically in data fed to ML pipelines. This is part of
TFX, a ML platform at Google.
68. Automating Large Scale Data Quality Verification
(Schelter, 2018, Schelter, 2019)
Deequ : Open Source Library
https://github.com/awslabs/deequ
71. RASL: Relational Algebra in Scikit-Learn
Pipelines(Sahni et al, 2021)
● One common practice is to use Spark for
data preprocessing, using aggregation to
reduce its size, followed by scikit-learn for
machine learning in a separate pipeline.
● This paper suggests adding relational
algebra operators (e.g. join, aggregates) to
Scikit-learn, such that these operators have
the same scikit learn syntax and semantics
Visualization of the data preparation part
Using RASL
Open Source : https://github.com/ibm/lale
72. Conclusions
● Scalability to large datasets is critical for enterprise workloads
● Some systems have been proposed that take advantage of architectures like
Spark to process large datasets
● Open areas on how to make these systems scalable for any data centric AI
operations like detection of label noise
73. In this tutorial, we have covered:
• Part 1: Importance of Data Centric AI
• Part 2: Advances in exploratory data analysis
• Part 3: Advances in data visualization techniques
• Part 4: Scalable data centric AI
• Open Discussion